10 research outputs found

    Χρήση μοντέλου παράλληλου προγραμματισμού για σύνθεση αρχιτεκτονικών

    Get PDF
    The problem of automatically generating hardware modules from high level application representations has been at the forefront of EDA research during the last few years. In this Dissertation we introduce a methodology to automatically synthesize hardware accelerators from OpenCL applications. OpenCL is a recent industry supported standard for writing programs that execute on multicore platforms and accelerators such as GPUs. Our methodology maps OpenCL kernels into hardware accelerators based on architectural templates that explicitly decouple computation from memory communication whenever this is possible. The templates can be tuned to provide a wide repertoire of accelerators that meet user performance requirements and FPGA device characteristics. Furthermore a set of high- and low-level compiler optimizations is applied to generate optimized accelerators. Our experimental evaluation shows that the generated accelerators are tuned efficiently to match the applications memory access pattern and computational complexity and to achieve user performance requirements. An important objective of our tool is to expand the FPGA development user base to software engineers thereby expanding the scope of FPGAs beyond the realm of hardware design.To πρόβλημα της αυτόματης δημιουργίας μονάδων υλικό από παραστάσεις υψηλού επιπέδου εφαρμογής είναι στην πρώτη γραμμή της EDA έρευνας κατά τη διάρκεια των τελευταίων ετών. Σε αυτή την διατριβή παρουσιάζουμε μια μεθοδολογία για τη αυτόματη σύνθεση επιταχυντές υλικού από εφαρμογές OpenCL. OpenCL είναι ένα πρόσφατο πρότυπο για τη σύνταξη των προγραμμάτων που εκτελούνται σε πλατφόρμες πολλαπλών πυρήνων και επιταχυντές όπως GPUs. Η μεθοδολογία μας μετατρέπει προγράμματα OpenCL σε επιταχυντές υλικού με βάση αρχιτεκτονικά πρότυπα που ρητά αποσυνδέει τους υπολογισμούς από την μεταφορά δεδομένων από/προς την μνήμη όποτε αυτό είναι δυνατό. Τα πρότυπα μπορούν να συντονιστούν ώστε να παρέχουν ένα ευρύ ρεπερτόριο από επιταχυντές που πληρούν τις απαιτήσεις απόδοσης των χρηστών και τα χαρακτηριστικά της συσκευής FPGA. Επιπλέον ένα σύνολο υψηλής και χαμηλής στάθμης βελτιστοποιήσεις μεταγλωττιστή εφαρμόζεται για να παράγει βελτιστοποιημένα επιταχυντές. Η πειραματική αξιολόγηση δείχνει ότι οι επιταχυντές που δημιουργούνται αποτελεσματικά συντονισμένοι για να ταιριάζει με το μοτίβο πρόσβασης στην μνήμη κάθε εφαρμογής και την υπολογιστική πολυπλοκότητα και να επιτύχουν τις απαιτήσεις απόδοσης των χρηστών. Ένας σημαντικός στόχος του εργαλείου μας είναι η επέκταση της βάσης χρηστών πλατφόρμες FPGA για μηχανικούς λογισμικού ώστε να γίνει ανάπτυξη FPGA συστήματα από μηχανικούς λογισμικού χωρίς την ανάγκη για εμπειρία σχεδιασμού υλικού

    Χρήση μοντέλου παράλληλου προγραμματισμού για σύνθεση αρχιτεκτόνων

    No full text
    The problem of automatically generating hardware modules from high level application representations has been at the forefront of EDA research during the last few years. In this Dissertation we introduce a methodology to automatically synthesize hardware accelerators from OpenCL applications. OpenCL is a recent industry supported standard for writing programs that execute on multicore platforms and accelerators such as GPUs. Our methodology maps OpenCL kernels into hardware accelerators based on architectural templates that explicitly decouple computation from memory communication whenever this is possible. The templates can be tuned to provide a wide repertoire of accelerators that meet user performance requirements and FPGA device characteristics. Furthermore a set of high- and low-level compiler optimizations is applied to generate optimized accelerators. Our experimental evaluation shows that the generated accelerators are tuned efficiently to match the applications memory access pattern and computational complexity and to achieve user performance requirements. An important objective of our tool is to expand the FPGA development user base to software engineers thereby expanding the scope of FPGAs beyond the realm of hardware design.To πρόβλημα της αυτόματης δημιουργίας μονάδων υλικό από παραστάσεις υψηλού επιπέδου εφαρμογής είναι στην πρώτη γραμμή της EDA έρευνας κατά τη διάρκεια των τελευταίων ετών. Σε αυτή την διατριβή παρουσιάζουμε μια μεθοδολογία για τη αυτόματη σύνθεση επιταχυντές υλικού από εφαρμογές OpenCL. OpenCL είναι ένα πρόσφατο πρότυπο για τη σύνταξη των προγραμμάτων που εκτελούνται σε πλατφόρμες πολλαπλών πυρήνων και επιταχυντές όπως GPUs. Η μεθοδολογία μας μετατρέπει προγράμματα OpenCL σε επιταχυντές υλικού με βάση αρχιτεκτονικά πρότυπα που ρητά αποσυνδέει τους υπολογισμούς από την μεταφορά δεδομένων από/προς την μνήμη όποτε αυτό είναι δυνατό. Τα πρότυπα μπορούν να συντονιστούν ώστε να παρέχουν ένα ευρύ ρεπερτόριο από επιταχυντές που πληρούν τις απαιτήσεις απόδοσης των χρηστών και τα χαρακτηριστικά της συσκευής FPGA. Επιπλέον ένα σύνολο υψηλής και χαμηλής στάθμης βελτιστοποιήσεις μεταγλωττιστή εφαρμόζεται για να παράγει βελτιστοποιημένα επιταχυντές. Η πειραματική αξιολόγηση δείχνει ότι οι επιταχυντές που δημιουργούνται αποτελεσματικά συντονισμένοι για να ταιριάζει με το μοτίβο πρόσβασης στην μνήμη κάθε εφαρμογής και την υπολογιστική πολυπλοκότητα και να επιτύχουν τις απαιτήσεις απόδοσης των χρηστών. Ένας σημαντικός στόχος του εργαλείου μας είναι η επέκταση της βάσης χρηστών πλατφόρμες FPGA για μηχανικούς λογισμικού ώστε να γίνει ανάπτυξη FPGA συστήματα από μηχανικούς λογισμικού χωρίς την ανάγκη για εμπειρία σχεδιασμού υλικού

    Κανονική συνθήκη ορισμού, λίστα ορισμών, διεύθυνση, σχεδίαση και υλοποίηση αρχιτεκτονικής υψηλής απόδοσης για τους αλγόριθμους μετασχηματισμού και κβαντοποίησης του Η.264

    No full text
    Efficient digital video coding techniques are increasingly gaining importance due to the widespread of low bit rate video streaming applications (like videotelephony and videoconferencing). This raises the need for an industry standard for compressed video representation with substantially increased coding efficiency and enhanced robustness to network environments. In 2001, the Joint Video Team (JVT) was formed to represent the cooperation between the ITU-T Video Coding Expert Group (VCEG) and the ISO/IEC Moving Picture Expert Group (MPEG) aiming for the development of a new Standard. The JVT aim was to finalize the H.26L proposal and convert it into an international standard (H.264/MPEG-4 Part 10) published by both ISO/IEC and ITU-T. H.264 provides similar functionality to earlier standards such as H.263+and MPEG-4 Visual (Simple Profile) but with significantly better compression performance and improved support for reliable transmission. It does not use the traditional 8^8 DCT transform as the basic transform, instead it suggests 4x4 DCT-based transform that can be implemented only using integer addition and shift units and avoids use of multiplication. In this project, a hardware prototype is designed for the H.264 supported quantization and variant types of supported transforms (core transform, 2x2 and 4x4 hadmard transforms). Also the inverse transform and quantization path is considered. The architecture is prototyped and simulated using ModelSim 6.1®. It is synthesized using Synopsys Design Compiler®

    Single-Pass Covariance Matrix Calculation on a Hybrid FPGA/CPU Platform

    Get PDF
    Covariance matrices are used for a wide range of applications in particle physics, including Kálmán filter for tracking purposes or Primary Component Analysis for dimensionality reduction. Based on a novel decomposition of the covariance matrix, a design that requires only one pass of data for calculating the covariance matrix is presented. Two computation engines are used depending on parallelizability of the necessary computation steps. The design is implemented onto a hybrid FPGA/CPU system and yields speed-up of up to 5 orders of magnitude compared to previous FPGA implementation

    Accelerating Pattern Matching Queries in Hybrid CPU-FPGA Architectures

    No full text
    Relational databases execute user queries through operator trees, where each operator has a well defined interface and a specific task (e.g., arithmetic function, pattern matching, aggregation, etc.). Hardware acceleration of compute intensive operators is a promising prospect but it comes with challenges. Databases execute tens of thousands of different queries per second. Thus, if only one specific instantiation of an operator is supported by the accelerator, it will have little effect on the overall workload. In this paper we explore the tradeoff between resource efficiency and expression complexity for an FPGA accelerator targeting string-matching operators (LIKE and REGEXPLIKE in SQL). This tradeoff is complex. For instance, the FPGA not always wins: simple queries that can be answered from indexes run faster on the CPU. On complex regular expressions, the FPGA is faster but needs to be parametrized at runtime to be able to support different queries. For very long patterns, the entire expression might not fit into the FPGA circuit and a combined mode CPU-FPGA must be chosen. We evaluate our design on a heterogeneous multi-core machine in which the FPGA has cache coherent access to the CPU memory. In addition to the string matching circuit, we also show how to implement database page parsing logic so as to be able to work directly on the same memory data structures as the database engine

    doppioDB: A hardware accelerated database

    No full text
    Relational databases provide a wealth of functionality to a wide range of applications. Yet, there are tasks for which they are less than optimal, for instance when processing becomes more complex (e.g., regular expression evaluation, data analytics) or the data is less structured (e.g., text or long strings). With the increasing amount of user-generated data stored in relational databases, there is a growing need to analyze unstructured text data. At the same time more complex analytical operators are required to extract useful information from the vast amount of collected data. However, many analytical operators incur a significant compute complexity not suitable to database engines where multiple queries share the available resources. In this demonstration we show the benefit of using specialized hardware for such tasks and highlight the importance of a flexible, reusable mechanism for extending database engines with hardware-based operators. Our hybrid database engine, doppioDB, is deployed on an emerging Xeon+FPGA multicore architecture where the CPU and FPGA have cache-coherent access to the same memory, such that the hardware operators can directly access the database tables. The demonstration is illustrating the acceleration benefits of hardware operators, as well as doppioDB's flexibility in accommodating changing workloads

    doppioDB 1.0: Machine Learning inside a Relational Engine

    No full text
    Advances in hardware are a challenge but also a new opportunity. In particular, devices like FPGAs and GPUs are a chance to extend and customize relational engines with new operations that would be difficult to support otherwise. Doing so would offer database users the possibility of conducting, e.g., complete data analyses involving machine learning inside the database instead of having to take the data out, process it in a different platform, and then store the results back in the database as it is often done today. In this paper we present doppioDB 1.0, an FPGA-enabled database engine incorporating FPGA-based machine learning operators into a main memory, columnar DBMS (MonetDB). This first version of doppioDB provides a platform for extending traditional relational processing with customizable hardware to support stochastic gradient descent and decision tree ensembles. Using these operators, we show examples of how they could be included into SQL and embedded as part of conventional components of a relational database engine. While these results are still a preliminary, exploratory step, they illustrate the challenges to be tackled and the advantages of using hardware accelerators as a way to extend database functionality in a non-disruptive manner

    doppioDB: A Hardware Accelerated Database

    No full text
    Relational databases provide a wealth of functionality to a wide range of applications. Yet, there are tasks for which they are less than optimal, for instance when processing becomes more complex (e.g., matching regular expressions) or the data is less structured (e.g., text or long strings). In this demonstration we show the benefit of using specialized hardware for such tasks and highlight the importance of a flexible, reusable mechanism for extending database engines with hardware-based operators. We present doppioDB which consists of MonetDB, a main-memory column store, extended with Hardware User Defined Functions (HUDFs). In our demonstration the HUDFs are used to provide seamless acceleration of two string operators, LIKE and REGEXPLIKE, and two analytics operators, SKYLINE and SGD (stochastic gradient descent). We evaluate doppioDB on an emerging hybrid multicore architecture, the Intel Xeon+FPGA platform, where the CPU and FPGA have cache-coherent access to the same memory, such that the hardware operators can directly access the database tables. For integration we rely on HUDFs as a unit of scheduling and management on the FPGA. In the demonstration we show the acceleration benefits of hardware operators, as well as their flexibility in accommodating changing workloads

    Lowering the Latency of Data Processing Pipelines Through FPGA based Hardware Acceleration

    No full text
    Web search engines often involve a complex pipeline of processing stages including computing, scoring, and ranking potential answers plus returning the sorted results. The latency of such pipelines can be improved by minimizing data movement, making stages faster, and merging stages. The throughput is determined by the stage with the smallest capacity and it can be improved by allocating enough parallel resources to each stage. In this paper we explore the possibility of employing hardware acceleration (an FPGA) as a way to improve the overall performance when computing answers to search queries. With a real use case as a baseline and motivation, we focus on accelerating the scoring function implemented as a decision tree ensemble, a common approach to scoring and classification in search systems. Our solution uses a novel decision tree ensemble implementation on an FPGA to: 1) increase the number of entries that can be scored per unit of time, and 2) provide a compact implementation that can be combined with previous stages. The resulting system, tested in Amazon F1 instances, significantly improves the quality of the search results and improves performance by two orders of magnitude over the existing CPU based solution.ISSN:2150-809

    Enhancing design space exploration by extending CPU/GPU specifications onto FPGAs

    No full text
    The design cycle for complex special-purpose computing systems is extremely costly and time-consuming. It involves a multiparametric design space exploration for optimization, followed by design verification. Designers of special purpose VLSI implementations often need to explore parameters, such as optimal bitwidth and data representation, through time-consuming Monte Carlo simulations. A prominent example of this simulation-based exploration process is the design of decoders for error correcting systems, such as the Low-Density Parity-Check (LDPC) codes adopted by modern communication standards, which involves thousands of Monte Carlo runs for each design point. Currently, high-performance computing offers a wide set of acceleration options that range from multicore CPUs to Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs). The exploitation of diverse target architectures is typically associated with developing multiple code versions, often using distinct programming paradigms. In this context, we evaluate the concept of retargeting a single OpenCL program to multiple platforms, thereby significantly reducing design time. A single OpenCL-based parallel kernel is used without modifications or code tuning on multicore CPUs, GPUs, and FPGAs. We use SOpenCL (Silicon to OpenCL), a tool that automatically converts OpenCL kernels to RTL in order to introduce FPGAs as a potential platform to efficiently execute simulations coded in OpenCL. We use LDPC decoding simulations as a case study. Experimental results were obtained by testing a variety of regular and irregular LDPC codes that range from short/medium (e.g., 8,000 bit) to long length (e.g., 64,800 bit) DVB-S2 codes. We observe that, depending on the design parameters to be simulated, on the dimension and phase of the design, the GPU or FPGA may suit different purposes more conveniently, thus providing different acceleration factors over conventional multicore CPUs. © 2015 AC
    corecore